-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Support DeepSeekV3-style block FP8 quantization with CT #21337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support DeepSeekV3-style block FP8 quantization with CT #21337
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for DeepSeekV3-style block FP8 quantization using compressed-tensors. The changes are extensive, touching several files related to quantization and introducing new logic for handling block-quantized weights, especially in MoE layers. The PR also adds support for new hardware features like DeepGEMM on Blackwell.
While the overall approach seems correct, I've found a critical issue in the control flow of process_weights_after_loading
in compressed_tensors_moe.py
. The logic for block-quantized and non-block-quantized paths is mixed, leading to duplicated operations and potential runtime errors. This needs to be refactored to ensure correctness.
Other changes, such as refactoring to decouple layers and using more specific type hints, are good improvements to the codebase.
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Redo of #20279
Relies on recent support in compressed-tensors (neuralmagic/compressed-tensors#372) and llm-compressor (vllm-project/llm-compressor#1607) to produce the models.
This PR implements W8A8 FP8 block quantization support for compressed-tensors models. This is focused on supporting the DeepSeekV3-style format, which has 128x128 block weights and 1x128 block activations (really per-token-group).
Most of the logic is ported directly from
fp8.py
and I hope to refactor the utilities to be shared eventually.Test Plan
Manual testing with newly produced models. I'll add lm-eval in another PR
Test Result
Dense
CT result:
Ref:
MoE
CT result:
Ref: